126

Applications in Natural Language Processing

FIGURE 5.4

The overview of algorithm prpoposed in [118].

dense matrix of multi-head self-attention is treated as a group. As a result, there will be 12

groups since there are 12 heads. Then, in each group, they bucket sequential output neurons

together as sub-groups, e.g., each N output neurons as one sub-group. Consequently, there

are 12× 64

N sub-group in total (the hidden dim in each head of Bert-base is 768

12 = 64). Now,

each subgroup has its own quantization range. Fig. 5.6 presents an illustration. Here Nh

FIGURE 5.5

Top eigenvalue distributions for different encoder layers for various datasets including SST-

2, MNLI, CoNNL-03, and SQuAD. The middle layers generally have higher mean values

and larger variance than the others. The last three layers have the smallest variance and

mean values among all layers.

FIGURE 5.6

The overview of group-wise quantization method proposed in [209]. Here Nh (number of

heads) value matrices Wv are concatenated together, resulting in a 3-d tensor. The same

color denotes the same group with a shared quantization range.